# EDA-Aware RTL Generation with Large Language Models

Mubashir ul Islam§, Humza Sami§, Pierre-Emmanuel Gaillardon‡§ and Valerio Tenace§

§PrimisAI, Los Gatos, CA, USA

‡University of Utah, Salt Lake City, UT, USA

Abstract—Large Language Models (LLMs) have become increasingly popular for generating RTL code. However, producing errorfree RTL code in a zero-shot setting remains highly challenging even for state-of-the-art LLMs, often leading to issues that require manual, iterative refinement. This additional debugging process can dramatically increase the verification workload, underscoring the need for robust, automated correction mechanisms to ensure code correctness from the start.

In this work, we introduce AIVRIL2, a self-verifying, LLM-agnostic agentic framework aimed at enhancing RTL code generation through iterative corrections of both syntax and functional errors. Our approach leverages a collaborative multi-agent system that incorporates feedback from error logs generated by EDA tools to automatically identify and resolve design flaws. Experimental results, conducted on the VerilogEval-Human benchmark suite, demonstrate that our framework significantly improves code quality, achieving nearly a  $3.4\times$  enhancement over prior methods. In the best-case scenario, functional pass rates of 77% for Verilog and 66% for VHDL were obtained, thus substantially improving the reliability of LLM-driven RTL code generation.

Index Terms—Large language models, Multi-agent systems, Generative AI, Electronic design automation

#### I. INTRODUCTION

The rapid development of Artificial Intelligence (AI) has led to transformative changes across various industries, with Large Language Models (LLMs) standing out as powerful tools capable of generating human-like text and interpreting complex user instructions. Within the field of hardware design, LLMs hold the potential to revolutionize the entire design process, with projections suggesting that both front-end and back-end tasks could soon become fully automated [1], [2]. Among the tasks gaining significant attention is the automated generation of Register Transfer Level (RTL) code. By interpreting user intent with minimal human input, LLMs can streamline workflows, effectively bridging the gap between conceptual design and physical implementation, thereby ultimately enhancing productivity.

However, despite their potential, the probabilistic nature of LLMs poses critical challenges, particularly in *zero-shot prompting* scenarios. In fact, without task-specific training, LLM outputs often contain syntactical and functional errors. While progress has been made in *Generative AI* (GenAI) for RTL design, concerns about the accuracy and reliability of generated code still remain. Frequently, the output requires substantial manual correction [3], diminishing the efficiency promised by this new technology which instead often leads to an increased verification burden. This manual iterative correction process is time-consuming and can introduce additional errors,

thus highlighting a critical gap in current solutions: the lack of robust, automated verification mechanisms enclosed within GenAI solutions.

In response, recent efforts have explored integrating multiagent systems into RTL code generation workflows. These systems utilize collaborative agents that leverage feedback from *Electronic Design Automation* (EDA) tools to iteratively debug both syntactical and functional errors [3], [4]. However, most current solutions are limited, addressing isolated issues, e.g., such as syntax correction or partial debugging, without offering a fully integrated LLM-driven code generation pipeline. Additionally, many approaches are optimized for a specific RTL language, usually Verilog [1], [5], [6], which restricts their applicability to different hardware description languages.

In this work, we build upon our previous contribution [7] to introduce AIVRIL2, a self-verifying, LLM-agnostic agentic framework designed to enhance LLM-driven RTL code generation by iteratively correcting both syntactical and functional errors. The key contributions of this paper are threefold:

- We present a new two-stage testbench and RTL code generation pipeline, with the first stage focused on syntactical corrections and the second on functional corrections. Both stages, or optimization loops, integrate LLMs with EDA tools, enabling continuous feedback between agents and tools for progressive code refinement.
- We detail the roles and behaviors of each specialized LLM-based agent: the *Code Agent*, responsible for generating robust RTL code and comprehensive testbenches; the *Review Agent*, which interprets complex EDA logs to detect and correct syntactical errors; and the *Verification Agent*, tasked with analyzing functional traces to ensure design accuracy.
- We demonstrate that AIVRIL2 is fully orthogonal to the target RTL language and completely LLM-agnostic, making it adaptable to various scenarios, thus enhancing its versatility for different and diverse workflows.

Experimental results, conducted on the VerilogEval-Human benchmark suite, demonstrate that AIVRIL2 achieves a  $3.4\times$  improvement over existing solutions in the best case, with pass@1 rates of 77% for Verilog and 66% for VHDL, underscoring the effectiveness of our proposed language-agnostic design.

The remainder of this paper is organized as follows: Section II provides background information and discusses related work, highlighting the limitations of existing approaches. Section III details the internal mechanisms of AIVRIL2, including its multi-

agent architecture and self-verification mechanisms. Section IV presents the experimental results, showcasing the advantages of our solutions over existing LLM-driven RTL generation techniques. Finally, Section V concludes the paper with a summary of the findings, suggesting potential directions for future research in enhancing automated RTL code verification.

#### II. BACKGROUND & RELATED WORK

Decision-making frameworks for multi-agent systems are set to substantially influence GenAI-driven methodologies in hardware design. This section provides an overview of how verbal reasoning and action planning interact in autonomous systems, underscoring their growing role in GenAI applications for RTL design. Recent breakthroughs and key challenges in this evolving field are also highlighted.

## A. Multi-Agent Systems

The integration of verbal reasoning with decision-making processes in autonomous systems has been a significant focus of recent research. LLMs have demonstrated their ability to manage multi-step reasoning, solving tasks such as arithmetic, commonsense reasoning, and symbolic operations [8]. By decomposing complex problems into sequential steps—an approach commonly referred to as chain-of-thought prompting—LLMs effectively enhance the reasoning process. However, a notable limitation of these models is their dependence on internal reasoning without factoring-in any external data. As the reasoning chain lengthens, the risk of errors or hallucinations increases, which can compromise the output reliability. Without external validation, models may generate responses that, while seemingly plausible, are ultimately inaccurate.

To mitigate these shortcomings, recent efforts have explored LLM agents in interactive environments where predictions and action planning are based on real-time observations [9]. A notable paradigm in this domain is ReAct [10], which couples reasoning with subsequent actions. This combination enables the model to plan and monitor actions while adapting dynamically to inputs received by the environment it operates in. This reasoning process supports the model in refining its plans, while external actions allow for interaction with real-world sources, such as databases, sensors, or other agents. This realworld feedback helps validate internal reasoning, reducing errors and hallucinations. As discussed in the next section, GenAI solutions for RTL design have increasingly adopted similar approaches, incorporating reasoning and action planning into unified frameworks. As a result, these strategies have enhanced the reliability and quality of RTL code, addressing key challenges in automated hardware design.

# B. GenAI for RTL Design

Chip-Chat [11] probably represents the first pioneering attempt at reproducing a fully automated hardware design workflow, from initial design to tapeout, by employing general-purpose LLMs like ChatGPT throughout the entire process. Since then, with the advent of advanced LLMs, zero-shot RTL code generation has significantly improved. These modern models leverage their language capabilities to effectively convert

specifications into RTL code, though their performance still falls short compared to what they achieve with other programming languages. Therefore, given the stringent reliability requirements in RTL design, recent efforts have increasingly focused on enhancing the quality-of-results of LLM-driven solutions.

Prior work targets specific phases of RTL generation, such as syntax correction, partial debugging, or specific error handling. Notable approaches include domain-adapted LLMs (e.g., Chip-NeMo [1]), data augmentation techniques [12], and Retrieval-Augmented Generation (RAG) [3]. While these methods provide notable improvements, they often lack a unified framework to integrate these advancements into a cohesive LLM-driven pipeline. For instance, RTLFixer [3] introduced a ReAct and RAG-based approach that leverages error logs to iteratively correct syntax errors. Other examples include VeriAssist [4] and VerilogCoder [13], which interactively interface LLMs with simulation tools to enable self-correction and self-verification mechanisms. VeriAssist, however, shows degraded performance when it only relies on self-generated testbenches. On the other hand, VerilogCoder does not generate testbenches as part of the verification process, therefore limiting its applicability to specific cases where external validation is already available. More recently, we introduced AIVRIL [7], aiming to provide a more robust framework for autonomously generating and verifying RTL code alongside testbenches. However, the simultaneous generation of RTL and testbenches introduces additional complexity into the overall process.

Building upon [7], the proposed AIVRIL2 framework adopts a testbench-first methodology, where an exhaustive testbench is first self-generated and then leveraged throughout the optimization phases. Moreover, our tool is designed to be language-agnostic, offering versatility across different RTL languages. By incorporating a multi-agent, ReAct-based mechanism, we seamlessly integrate verification processes into the LLM-driven design flow, significantly improving reliability and flexibility. This comprehensive approach addresses current limitations by providing a holistic and adaptable framework for RTL code generation, ultimately enhancing the quality and applicability of LLM-generated designs across various hardware contexts.

#### III. AIVRIL2: EDA-AWARE RTL DESIGN

In this section, we detail the internal mechanisms of the proposed framework: a two-stage LLM-aware RTL design methodology with enforced functional verification. The overall structure is depicted in Fig. 1. The framework is built around two key loops: the *Syntax Optimization* and the *Functional Optimization* loop, each governed by three agents: the *Code Agent, Review Agent*, and *Verification Agent*. The *Syntax Optimization* loop is supervised by the Review Agent, while the *Functional Optimization* one falls under the purview of the *Verification Agent*, as detailed in the following.

## A. Code Agent

The *Code Agent* acts as the primary code generation component within AIVRIL2, and it is responsible for translating user requirements into functional RTL code. As the only source of code generation throughout the process, it ensures design



Figure 1. Architecture overview of the proposed AIVRIL2 framework.

consistency and coherence. The agent begins by analyzing the user-provided prompt, which outlines the desired RTL design functionality. For reference, Figure 2 illustrates an example involving requirements for a shift register (box denoted with 1). If the prompt lacks sufficient detail, the Code Agent initiates an interactive dialogue with the user to gather further information. In practice, this behavior is accomplished via additional ad hoc system prompts. Once a complete prompt is provided, the agent first generates a comprehensive testbench based on the received specifications, ensuring that all potential test cases, that a functionally correct RTL design must pass, are covered. This represents our self-verification approach, an example of which is illustrated in Figure 2, step 2. This step is critical, as it sets the baseline for the subsequent verification process. Using both the user prompt and the generated testbench as references, the Code Agent then produces an initial version of the RTL code (step 3). As shown in the example, this initial code is forwarded to the Review Agent for syntax checking and further validation. At each iteration, the Code Agent plays a key role in refining the RTL design. It incorporates feedback from other agents in the form of corrective prompts, implicitly managing different versions of the RTL code throughout the iterative process, thereby facilitating easy change tracking and also enabling rollbacks when necessary.

# B. Review Agent

The main role of the *Review Agent* is to ensure the syntactical correctness of the generated code. Capable of integrating with any industry-standard RTL compiler, its primary function is to meticulously review the code for syntax errors and provide detailed feedback, distilled from compilation logs, to the *Code Agent*. As illustrated in Figure 2, the *Review Agent* receives the initial code produced by the *Code Agent* based on the user's design specifications. Upon compilation, a comprehensive log file is generated, which serves as the primary input for the *Review Agent*'s analysis. At this stage, the *Review Agent* performs an in-depth examination of the output. It analyzes the logs for any signs of syntax errors, carefully parsing the

information to identify specific issues within the code. This analysis goes beyond simple error detection: the agent also identifies the exact locations of errors by extracting line numbers and relevant code snippets from the log file, with the ultimate goal of converting them into corrective prompts. In the example shown in Figure 2, the Review Agent's analysis finds no syntax errors in the initial code (step 4). This successful syntax check allows the process to move directly to the functional verification stage. On the other hand, if syntax errors are detected, the agent generates a highly detailed and actionable corrective prompt. This prompt provides a comprehensive breakdown of each syntax error, including the exact line numbers where the errors occur, relevant code snippets surrounding the problematic areas, and potential suggestions or hints for resolving the syntax issues. It is important to note that the level of detail provided is crucial, as it allows the Code Agent to quickly identify and correct syntax issues in the minimum number of iterations. The Review Agent continues this process iteratively, working in close coordination with the Code Agent until the code achieves syntactical correctness. This iterative refinement ensures that the code progresses to the functional verification stage only after it has been thoroughly validated for syntax errors.

## C. Verification Agent

The Verification Agent represents the final stage in our framework and is responsible for ensuring the functional correctness of the RTL design. This agent triggers once both the RTL code and the testbench have been validated as syntactically correct by the Review Agent. The primary objective of the Verification Agent is to verify that the generated RTL code passes all the test cases outlined in the testbench, as generated at step 3 or 6, depending on the stage of the process. As also shown in Figure 2, the Verification Agent's workflow begins with the simulation process. The agent then analyzes the resulting simulation logs in order to detect any discrepancies between the expected and the actual outputs. In the given example, the initial simulation identifies a functional error: "Test Case 2 Failed: shift\_ena should be 0 after 4 clock cycles," as detailed in step



Figure 2. Practical example of the proposed workflow and internal state representation of the agents in AIVRIL2.

**⑤**. Based on this analysis, the *Verification Agent* generates a corrective prompt to guide the *Code Agent* in resolving the functional issues identified during the simulation (steps **⑥** and **⑦**). A critical aspect of this verification workflow is the consistent use of the same testbench throughout all iterations. While the RTL code may undergo multiple revisions based on feedback, the testbench remains unchanged. This approach ensures a standardized and unbiased evaluation of each RTL version, allowing for precise tracking of improvements and consistent verification in meeting the original functional requirements. The *Verification Agent* operates in synergy with the *Code Agent*, providing feedback and receiving updated RTL designs until either all test cases pass successfully or a predefined maximum number of iterations is reached. In the example, this iterative process is highlighted when, after re-

verification following *Code Agent*'s refinements, the output log confirms that "All tests passed successfully!" (step §). This outcome indicates that the RTL now fully satisfies the user's requirements, demonstrating the efficiency of the feedback loop between the *Verification* and the *Code Agent*.

## IV. EXPERIMENTAL RESULTS

In this section, we present the experimental evaluation of the proposed AIVRIL2 framework. Our goal is to rigorously assess the performance and robustness of the tool across a diverse set of benchmarks, ensuring a thorough and unbiased analysis of its capabilities. To achieve this, we selected key evaluation metrics that emphasize the strengths of our approach in realistic scenarios. Specifically, we focused on metrics that address both syntactical and functional correctness, providing a

Table I Summary of pass-rate results, with column  $\Delta_{\mathcal{F}}$  showing the percentage improvement of the proposed technique over the corresponding baseline model in terms of functional pass rate. All values are expressed as percentages.

| Technology                  | Verilog             |                                 |                        | VHDL                   |                                 |                        |
|-----------------------------|---------------------|---------------------------------|------------------------|------------------------|---------------------------------|------------------------|
|                             | pass@1 <sub>S</sub> | $\mathbf{pass}@1_{\mathcal{F}}$ | $\Delta_{\mathcal{F}}$ | $pass@1_{\mathcal{S}}$ | $\mathbf{pass}@1_{\mathcal{F}}$ | $\Delta_{\mathcal{F}}$ |
| Llama3-70B                  | 71.15               | 37.82                           | _                      | 1.28                   | 0                               | _                      |
| GPT-4o                      | 71.79               | 51.29                           | -                      | 39.1                   | 27.56                           | -                      |
| Claude 3.5 Sonnet           | 91.03               | 60.23                           | -                      | 88.46                  | 53.85                           | -                      |
| AIVRIL2 (Llama3-70B)        | 100                 | 55.13                           | 45.76                  | 58.87                  | 32.69                           | N/A                    |
| AIVRIL2 (GPT-40)            | 100                 | 72.44                           | 41.23                  | 100                    | 59.62                           | 116.32                 |
| AIVRIL2 (Claude 3.5 Sonnet) | 100                 | 77                              | 27.84                  | 100                    | 66                              | 22.56                  |
| Average                     |                     |                                 | 38.28                  |                        |                                 | <b>≫ 69.44</b>         |

comprehensive perspective on the effectiveness of our solution in handling complex design tasks.

#### A. Methodology

We employed all 156 benchmarks from the VerilogEval-Human benchmark suite [5] in our experiments, which enabled us to encompass a broad range of design complexities. For both optimization loops, performance was evaluated using the unbiased pass@k estimator (with k=1), as described in [14]. We distinguish between pass@ $k_S$ , which represents the success rate of designs passing all syntax checks, and pass@ $k_F$ , which reflects the success rate of designs that are not only syntactically correct but also functionally accurate. Notably, pass@ $k_F$  was determined by executing the testbenches provided in the benchmark suite, ensuring a comprehensive validation of the overall approach.

For both the syntax check and functional simulation stages, we utilized Vivado Design Suite - HLx Editions 2018.1, as to easily enable mixed-language simulations. To gain broader insights into the capabilities of various LLMs in generating RTL, we employed different models for the agents: Claude 3.5 Sonnet [15], GPT-4o [16], and Llama3-70B [17]. All models were used without any fine-tuning or RAG integration. Additionally, the *temperature* and *top\_p* parameters for each LLM were set to 0.2 and 0.1, respectively.

## B. Results & Discussion

Table I reports the experimental results. The table is organized as follows: for each of the two target RTL languages, we report the syntax pass rate (column pass@ $1_S$ ), the functional pass rate (column pass@ $1_{\mathcal{F}}$ ), and the improvements, in terms of functional pass rate, of AIVRIL2 w.r.t. the corresponding baseline LLM (column  $\Delta_{\mathcal{F}}$ ). As the data suggest, in all configurations, our framework achieves a pass@1<sub>S</sub> of 100%, except in one case: Llama3-70B for VHDL, which attained a pass@1<sub>S</sub> of only 58.87%. While not perfect, this result still demonstrates the efficacy of the Syntax Optimization loop, considering that the baseline model only achieved a pass@1<sub>S</sub> of 1.28%, marking an impressive  $46 \times$  improvement in code quality. This outcome highlights a significant gap in Llama3-70B's foundation knowledge concerning VHDL design, likely due to the scarcity of VHDL source code included in its training data. Consequently, the baseline pass@ $1_{\mathcal{F}}$  for

Llama3-70B is 0%. However, even in this case, AIVRIL2 was able to recover functional correctness in the analyzed VHDL designs, achieving a pass@1<sub>F</sub> of 32.69%. More broadly, our approach improved RTL code quality generated by the baseline LLMs by 38.28% for Verilog and at least 69.44% for VHDL, on average. These results highlight the effectiveness of the proposed framework in enhancing both syntactical and functional correctness of RTL designs across all LLMs and RTL languages. The significant improvements in VHDL, despite the models' initial disadvantages, underscore the capability of AIVRIL2 to boost performance across different hardware description languages, even when the training data for the underlying model is imbalanced.



Figure 3. Average latency breakdown across optimization loops for the proposed framework. Reported figures account for the execution times of EDA tools.

Architectural differences among LLMs often lead to varying execution times. This is particularly important to consider when applying optimization loops around LLMs, as it helps validate the practicality of our approach in real-world scenarios. Figure 3 illustrates the average latency for different LLMs, as well as a breakdown of the execution times for both optimization loops across all considered LLM configurations. The most significant latency increase was observed with Llama3-70B when generating VHDL. As the plot suggests, the latency gap from the baseline in this case is approximately  $6 \times$  (e.g., 6.68 vs. 39.29 seconds). This is partly due to the higher number of iterations required by Llama3-70B to converge towards a solution. More specifically, Llama3-70B required an average of

3.95 cycles for the Syntax Optimization loop and 4.7 cycles for the Functional Optimization loop to converge. In contrast, the smallest latency increase was recorded with Claude 3.5 Sonnet for Verilog generation, which showed roughly a 2× increase in execution time. This configuration required an average of 2 steps for the Syntax Optimization loop and 3 steps for the Functional Optimization loop. Overall, while some latency gaps may appear significant, it is worth emphasizing that the worst-case average latency introduced by our approach did not exceed 42 seconds. This is a reasonable trade-off, considering the substantial time saved in avoiding potential manual debugging and verification. Another noteworthy observation is the latency recorded for Claude 3.5 Sonnet during the Functional Optimization loop for VHDL, which was the highest among all solutions. This increased latency can be attributed to the LLM's own processing, particularly due to the higher complexity induced by corrective prompts.

## C. Comparison with State-of-the-Art Approaches

As already discussed in Section II-B, recent frameworks and fine-tuned LLMs have been introduced to enhance RTL code quality within the context of GenAI solutions. Table II provides a comparison between our proposed framework and existing solutions. For each technique, we report the license associated with the adopted LLM and the pass@ $1_{\mathcal{F}}$  metric. Due to limited data availability, our comparison focuses on Verilog generation only, as, to the best of the authors' knowledge, this work is among the first to evaluate GenAI solutions for VHDL. As shown in the table, our approach outperforms existing solutions in both open-source and closed-source regimes. Most notably, the highest performance gap is recorded w.r.t. ChipNemo-13B [1], where our solution achieves a  $3.4\times$  higher pass@ $1_{\mathcal{F}}$ . These results further highlight the strength of AIVRIL2 in achieving state-of-the-art performance in RTL generation.

Table II Comparison of state-of-the-art RTL generation techniques. Column pass@ $1_{\mathcal{F}}$  only reports numbers for Verilog.

| Technology                                      | Model License | pass@1 <sub>F</sub> (%) |  |
|-------------------------------------------------|---------------|-------------------------|--|
| Llama3-70B [17]                                 |               | 37.82                   |  |
| CodeGen-16B [18]                                | Open Source   | 41.9                    |  |
| CodeV-CodeQwen [6]                              |               | 53.2                    |  |
| ChipNemo-13B [1]                                |               | 22.4                    |  |
| ChipNemo-70B [1]                                |               | 27.6                    |  |
| CodeGen-16B-Verilog-SFT [5]                     |               | 28.8                    |  |
| RTLFixer [3]                                    | Closed Source | 36.8                    |  |
| VeriAssist [4]                                  | Closed Source | 50.5                    |  |
| GPT-4o [16]                                     |               | 51.29                   |  |
| Claude 3.5 Sonnet [15]                          |               | 60.23                   |  |
| AIVRIL [7]                                      |               | 67.3                    |  |
| AIVRIL2 (Llama3-70B)                            | Open Source   | 55.13                   |  |
| AIVRIL2 (GPT-40)<br>AIVRIL2 (Claude 3.5 Sonnet) | Closed Source | 72.44<br>77             |  |

# V. CONCLUSIONS

In this work, we introduced AIVRIL2, a novel, self-verifying, LLM-aware RTL design framework that entails a *Syntax Opti-*

mization and a Functional Verification loop to enhance syntax and functional correctness. Experimental results demonstrated that our framework significantly improves code quality across a wide range of benchmarks, outperforming baseline models and prior solutions in both Verilog and VHDL generation. Despite some added latency, the overall execution times remained reasonable, making the trade-off worthwhile given the reduction in manual verification. These results highlight the robustness and versatility of the proposed framework, paving the way for future advancements in automated GenAI for RTL.

#### REFERENCES

- [1] M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktaroglu et al., "Chipnemo: Domainadapted llms for chip design," arXiv preprint arXiv:2311.00176, 2023.
- [2] H. Wu, Z. He, X. Zhang, X. Yao, S. Zheng, H. Zheng, and B. Yu, "Chateda: A large language model powered autonomous agent for eda," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 2024.
- [3] Y. Tsai, M. Liu, and H. Ren, "Rtlfixer: Automatically fixing rtl syntax errors with large language models," arXiv preprint arXiv:2311.16543, 2023
- [4] H. Huang, Z. Lin, Z. Wang, X. Chen, K. Ding, and J. Zhao, "Towards llm-powered verilog rtl assistant: Self-verification and self-correction," arXiv preprint arXiv:2406.00115, 2024.
- [5] M. Liu, N. Pinckney, B. Khailany, and H. Ren, "VerilogEval: evaluating large language models for verilog code generation," in 2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2023.
- [6] Y. Zhao, D. Huang, C. Li, P. Jin, Z. Nan, T. Ma, L. Qi, Y. Pan, Z. Zhang, R. Zhang et al., "Codev: Empowering Ilms for verilog generation through multi-level summarization," arXiv preprint arXiv:2407.10424, 2024.
- [7] M. ul Islam, H. Sami, P.-E. Gaillardon, V. Tenace et al., "Aivril: Ai-driven rtl generation with verification in-the-loop," arXiv preprint arXiv:2409.11411, 2024.
- [8] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., "Chain-of-thought prompting elicits reasoning in large language models," Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
- [9] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, "Language models as zero-shot planners: Extracting actionable knowledge for embodied agents," in *International Conference on Machine Learning*. PMLR, 2022, pp. 9118–9147.
- [10] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, "React: Synergizing reasoning and acting in language models," arXiv preprint arXiv:2210.03629, 2022.
- [11] J. Blocklove, S. Garg, R. Karri, and H. Pearce, "Chip-chat: Challenges and opportunities in conversational hardware design," in 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD). IEEE, 2023, pp. 1–6.
- [12] K. Chang, K. Wang, N. Yang, Y. Wang, D. Jin, W. Zhu, Z. Chen, C. Li, H. Yan, Y. Zhou et al., "Data is all you need: Finetuning Ilms for chip design via an automated design-data augmentation framework," arXiv preprint arXiv:2403.11202, 2024.
- [13] C.-T. Ho, H. Ren, and B. Khailany, "Verilogcoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree (ast)based waveform tracing tool," arXiv preprint arXiv:2408.08927, 2024.
- [14] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., "Evaluating large language models trained on code," arXiv preprint arXiv:2107.03374, 2021.
- [15] Anthropic, "Claude 3.5 Sonnet Model Card Addendum," 2024.
- [16] OpenAI, "GPT-40 System Card," 2024.
- [17] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., "The llama 3 herd of models," arXiv preprint arXiv:2407.21783, 2024.
- [18] S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan-Gavitt, and S. Garg, "Benchmarking large language models for automated verilog rtl code generation," in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6.